Instruction¶

I have a typical project of predicting the NYC uber/lyft trip demand. The dataset is available from Jan2022 to March 2023. The area is already divided into different locations. and I want the predicted demand for each location every 15 mins

Problem statment¶

The goal of this project is to predict the demand for Uber/Lyft trips in different locations of NYC every 15 minutes, using a dataset spanning from January 2022 to March 2023. The dataset includes information such as the dispatching base number, pickup datetime, drop-off datetime, pickup location ID, drop-off location ID, SR_Flag, and affiliated base number

In [1]:
import pandas as pd
import glob
import tqdm
import pandas as pd
import plotly.graph_objects as go
from statsmodels.tsa.arima.model import ARIMA
from dateutil.relativedelta import relativedelta
import numpy as np
from pmdarima import auto_arima
In [2]:
# Uses the glob.glob function to retrieve a list of file paths that match the specified 
# pattern 'Datasets/fhv_tripdata_2022-2023_in_csv/*.csv'. 
# This pattern is used to find all CSV files in the given directory.
data_list_path = glob.glob('Datasets/fhv_tripdata_2022-2023_in_csv/*.csv')

# Initializes an empty list called list_df to store the DataFrames.
list_df = []
# terates over each file path in data_list_path
for path in data_list_path:
    print(path)
    # Step 1: Preprocess the Dataset
    # inside the loop, it reads each CSV file using pd.read_csv and assigns it to the variable df.
    df = pd.read_csv(path)
    # Appends the DataFrame to the list_df list.
    list_df.append(df)
    
# After the loop, it concatenates all the DataFrames in list_df into a single DataFrame using pd.concat. 
# The concatenated DataFrame is assigned to the variable df
df =  pd.concat(list_df)

# Specifies a list of column names ('pickup_datetime' and 'PUlocationID') 
# in interested_features that you are interested in keeping
interested_features = ['pickup_datetime','PUlocationID']
# Updates df to contain only the columns specified in interested_features using indexing.
df = df[interested_features]


# Summary :
# Overall, this code reads multiple CSV files from the specified directory, 
# concatenates them into a single DataFrame, and then selects and keeps only the columns specified in interested_features
Datasets/fhv_tripdata_2022-2023_in_csv/fhv_tripdata_2022-09.csv
Datasets/fhv_tripdata_2022-2023_in_csv/fhv_tripdata_2022-02.csv
Datasets/fhv_tripdata_2022-2023_in_csv/fhv_tripdata_2022-04.csv
Datasets/fhv_tripdata_2022-2023_in_csv/fhv_tripdata_2022-07.csv
Datasets/fhv_tripdata_2022-2023_in_csv/fhv_tripdata_2022-01.csv
Datasets/fhv_tripdata_2022-2023_in_csv/fhv_tripdata_2022-06.csv
Datasets/fhv_tripdata_2022-2023_in_csv/fhv_tripdata_2022-08.csv
Datasets/fhv_tripdata_2022-2023_in_csv/fhv_tripdata_2023-03.csv
Datasets/fhv_tripdata_2022-2023_in_csv/fhv_tripdata_2022-11.csv
Datasets/fhv_tripdata_2022-2023_in_csv/fhv_tripdata_2022-12.csv
Datasets/fhv_tripdata_2022-2023_in_csv/fhv_tripdata_2023-02.csv
Datasets/fhv_tripdata_2022-2023_in_csv/fhv_tripdata_2022-03.csv
Datasets/fhv_tripdata_2022-2023_in_csv/fhv_tripdata_2023-01.csv
Datasets/fhv_tripdata_2022-2023_in_csv/fhv_tripdata_2022-05.csv
Datasets/fhv_tripdata_2022-2023_in_csv/fhv_tripdata_2022-10.csv
In [3]:
# The code imports the necessary libraries:
import pandas as pd
import pmdarima as pm
import plotly.graph_objects as go
from sklearn.model_selection import train_test_split

# Prints the number of rows in the DataFrame df before removing rows with NaN values
# This line uses the .shape[0] attribute of a DataFrame to retrieve the number of rows.
print('Number of Rows Before Removing NaN:', df.shape[0])
# Removes rows with NaN values from the DataFrame df and assigns the result to removed_nan_df:
removed_nan_df = df.dropna()
#The .dropna() method is used to remove rows containing any NaN values. 
#The resulting DataFrame with NaN rows removed is assigned to removed_nan_df/
print('Number of Rows After Removing NaN:', removed_nan_df.shape[0])
Number of Rows Before Removing NaN: 17712727
Number of Rows After Removing NaN: 4164902
In [4]:
# Retrieves unique values from the 'PUlocationID' column in the 
# DataFrame removed_nan_df and converts them to a list:
location_ids = removed_nan_df['PUlocationID'].unique().tolist()

# Initializes a loop counter variable:
loop_count = 0
# Iterates over each unique location ID in location_ids:
for lc_id in location_ids:
    # Prints the current location ID:

    print('Location ID:', lc_id)
    # Filters removed_nan_df to create a subset DataFrame (df_subset)
    # containing rows with a specific 'PUlocationID
    df_subset = removed_nan_df[removed_nan_df['PUlocationID'] == lc_id]
    # Converts the 'pickup_datetime' column in df_subset to datetime format using pd.to_datetime
    df_subset['pickup_datetime'] = pd.to_datetime(df_subset['pickup_datetime'])
    # Sorts df_subset based on the 'pickup_datetime' column:
    df_subset = df_subset.sort_values('pickup_datetime')
    # Sets the 'pickup_datetime' column as the index of df_subset:
    df_subset = df_subset.set_index('pickup_datetime')
    # Resamples df_subset at a frequency of 1 hour ('1H') and counts the number of occurrences per hour:
    df_subset = df_subset['PUlocationID'].resample('1H').count()
    # Resets the index of df_subset to convert the index (time) back into a column:
    df_subset = df_subset.reset_index()
    # Split data into training and testing sets
    # Determines the train-test split based on the length of df_subset,
    # with 95% of the data used for training and the remaining 5% for testing:
    train_size = int(len(df_subset) * 0.95)
    train_data = df_subset[:train_size]
    test_data = df_subset[train_size:]

    # Perform auto ARIMA on training data
    # Uses the pm.auto_arima function from the pmdarima 
    # library to automatically determine the best ARIMA model for the training data:
    model = pm.auto_arima(train_data['PUlocationID'], seasonal=True, trace=True)

    # Generate predictions
    # Generates predictions (forecast) using the ARIMA model and
    # calculates the confidence interval (conf_int) for the length of the test data
    forecast, conf_int = model.predict(n_periods=len(test_data), return_conf_int=True)

     # Create a dataframe for predictions and actual values
    result_df = pd.DataFrame({
        'pickup_datetime': test_data['pickup_datetime'],
        'Actual': test_data['PUlocationID'],
        'Forecast': forecast
    })
    # Save the dataframe to a CSV file
    filename = f"arima-results/{lc_id}_data.csv"
    result_df.to_csv(filename, index=False)

    # Plotting
    # Creates a Plotly figure (fig) and adds traces for the training data, testing data, and ARIMA forecast:
    fig = go.Figure()
    fig.add_trace(go.Scatter(x=train_data.index, y=train_data['PUlocationID'], mode='lines+markers', name='Training Data'))
    fig.add_trace(go.Scatter(x=test_data.index, y=test_data['PUlocationID'], mode='lines+markers', name='Testing Data'))
    fig.add_trace(go.Scatter(x=test_data.index, y=forecast, mode='lines+markers', name='ARIMA Forecast'))
    # Updates the layout of the figure with a title and axis labels:

    fig.update_layout(title=f'PickLocation ID: {lc_id}', xaxis_title='Time', yaxis_title='Number Drives')
    # Displays the figure:
    fig.show()
    loop_count +=1
    # Checks if the loop count is greater than 5 and breaks the loop if so:
    if loop_count >1:
        break

# Summary
# if you want to code for each location then remove if condition of loop counter
# Overall, this code performs time series analysis and forecasting for each location ID, allowing for an understanding 
# of the patterns and trends in the dataset at different locations.
Location ID: 12.0
Performing stepwise search to minimize aic
/tmp/ipykernel_10703/3719424878.py:16: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_subset['pickup_datetime'] = pd.to_datetime(df_subset['pickup_datetime'])
 ARIMA(2,0,2)(0,0,0)[0] intercept   : AIC=-11849.933, Time=18.91 sec
 ARIMA(0,0,0)(0,0,0)[0] intercept   : AIC=-11825.381, Time=4.77 sec
 ARIMA(1,0,0)(0,0,0)[0] intercept   : AIC=-11844.742, Time=10.76 sec
 ARIMA(0,0,1)(0,0,0)[0] intercept   : AIC=-11843.762, Time=4.85 sec
 ARIMA(0,0,0)(0,0,0)[0]             : AIC=-11650.357, Time=0.41 sec
 ARIMA(1,0,2)(0,0,0)[0] intercept   : AIC=-11854.674, Time=35.79 sec
 ARIMA(0,0,2)(0,0,0)[0] intercept   : AIC=-11847.957, Time=7.66 sec
 ARIMA(1,0,1)(0,0,0)[0] intercept   : AIC=-11846.164, Time=6.36 sec
 ARIMA(1,0,3)(0,0,0)[0] intercept   : AIC=-11844.097, Time=18.47 sec
 ARIMA(0,0,3)(0,0,0)[0] intercept   : AIC=-11847.073, Time=9.55 sec
 ARIMA(2,0,1)(0,0,0)[0] intercept   : AIC=-11854.373, Time=21.05 sec
 ARIMA(2,0,3)(0,0,0)[0] intercept   : AIC=-11853.703, Time=107.06 sec
 ARIMA(1,0,2)(0,0,0)[0]             : AIC=inf, Time=9.08 sec

Best model:  ARIMA(1,0,2)(0,0,0)[0] intercept
Total fit time: 254.856 seconds
Location ID: 89.0
/tmp/ipykernel_10703/3719424878.py:16: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

Performing stepwise search to minimize aic
 ARIMA(2,1,2)(0,0,0)[0] intercept   : AIC=50398.076, Time=44.34 sec
 ARIMA(0,1,0)(0,0,0)[0] intercept   : AIC=51768.990, Time=1.06 sec
 ARIMA(1,1,0)(0,0,0)[0] intercept   : AIC=50564.942, Time=2.41 sec
 ARIMA(0,1,1)(0,0,0)[0] intercept   : AIC=50454.990, Time=3.87 sec
 ARIMA(0,1,0)(0,0,0)[0]             : AIC=51766.991, Time=0.28 sec
 ARIMA(1,1,2)(0,0,0)[0] intercept   : AIC=inf, Time=50.57 sec
 ARIMA(2,1,1)(0,0,0)[0] intercept   : AIC=50447.109, Time=8.11 sec
 ARIMA(3,1,2)(0,0,0)[0] intercept   : AIC=50260.815, Time=33.35 sec
 ARIMA(3,1,1)(0,0,0)[0] intercept   : AIC=50442.066, Time=15.09 sec
 ARIMA(4,1,2)(0,0,0)[0] intercept   : AIC=50238.059, Time=57.19 sec
 ARIMA(4,1,1)(0,0,0)[0] intercept   : AIC=50429.196, Time=15.57 sec
 ARIMA(5,1,2)(0,0,0)[0] intercept   : AIC=50165.209, Time=65.94 sec
 ARIMA(5,1,1)(0,0,0)[0] intercept   : AIC=50401.097, Time=18.52 sec
 ARIMA(5,1,3)(0,0,0)[0] intercept   : AIC=inf, Time=88.14 sec
 ARIMA(4,1,3)(0,0,0)[0] intercept   : AIC=inf, Time=70.16 sec
 ARIMA(5,1,2)(0,0,0)[0]             : AIC=50163.210, Time=11.05 sec
 ARIMA(4,1,2)(0,0,0)[0]             : AIC=50236.059, Time=7.19 sec
 ARIMA(5,1,1)(0,0,0)[0]             : AIC=50399.097, Time=4.14 sec
 ARIMA(5,1,3)(0,0,0)[0]             : AIC=inf, Time=13.49 sec
 ARIMA(4,1,1)(0,0,0)[0]             : AIC=50427.197, Time=1.92 sec
 ARIMA(4,1,3)(0,0,0)[0]             : AIC=inf, Time=8.28 sec

Best model:  ARIMA(5,1,2)(0,0,0)[0]          
Total fit time: 520.790 seconds